library(tidyverse) # for graphing and data cleaning
library(gardenR) # for Lisa's garden data
library(lubridate) # for date manipulation
library(ggthemes) # for even more plotting themes
library(geofacet) # for special faceting with US map layout
theme_set(theme_minimal()) # My favorite ggplot() theme :)
# Lisa's garden data
data("garden_harvest")
# Seeds/plants (and other garden supply) costs
data("garden_spending")
# Planting dates and locations
data("garden_planting")
# Tidy Tuesday dog breed data
breed_traits <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_traits.csv')
trait_description <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/trait_description.csv')
breed_rank_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_rank.csv')
# Tidy Tuesday data for challenge problem
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
Before starting your assignment, you need to get yourself set up on GitHub and make sure GitHub is connected to R Studio. To do that, you should read the instruction (through the “Cloning a repo” section) and watch the video here. Then, do the following (if you get stuck on a step, don’t worry, I will help! You can always get started on the homework and we can figure out the GitHub piece later):
keep_md: TRUE in the YAML heading. The .md file is a markdown (NOT R Markdown) file that is an interim step to creating the html file. They are displayed fairly nicely in GitHub, so we want to keep it and look at it there. Click the boxes next to these two files, commit changes (remember to include a commit message), and push them (green up arrow).Put your name at the top of the document.
For ALL graphs, you should include appropriate labels.
Feel free to change the default theme, which I currently have set to theme_minimal().
Use good coding practice. Read the short sections on good code with pipes and ggplot2. This is part of your grade!
When you are finished with ALL the exercises, uncomment the options at the top so your document looks nicer. Don’t do it before then, or else you might miss some important warnings and messages.
These exercises will reiterate what you learned in the “Expanding the data wrangling toolkit” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week (HINT: use the wday() function from lubridate). Display the results so that the vegetables are rows but the days of the week are columns.garden_harvest %>%
mutate(wkday = wday(date, label = TRUE, abbr = TRUE)) %>%
group_by(vegetable, wkday) %>%
summarise(g_per_wday = sum(weight)) %>%
mutate( lbs_per_wday = g_per_wday * 0.00220462) %>%
ungroup() %>%
arrange(wkday) %>%
pivot_wider(id_cols = "vegetable",
names_from = "wkday",
values_from ="lbs_per_wday")
garden_harvest data to find the total harvest in pounds for each vegetable variety and then try adding the plot from the garden_planting table. This will not turn out perfectly. What is the problem? How might you fix it?garden_harvest %>%
group_by(vegetable, variety) %>%
summarise(g = sum(weight)) %>%
mutate( lbs = g * 0.00220462) %>%
select(-"g") %>%
full_join(garden_planting %>% select(c('vegetable', 'variety', 'plot')),
by = c('vegetable', 'variety'))
garden_harvest and garden_spending datasets, along with data from somewhere like this to answer this question. You can answer this in words, referencing various join functions. You don’t need R code but could provide some if it’s helpful.In order to understand how much money you saved, you first need to know how much you spent to get a certain number of pounds of a vegetable and variety. You could do this by first calculating each harvest in terms of pounds. Then we can find the total harvest in lbs for each vegetable variety. Next you could left join garden_harvest to garden_spending so that for each vegetable and variety, we can see the total lbs harvested and the price paid (also with tax) for that amount of harvest. After joining, you can sum the price with tax and lbs harvested for each vegetable. Using the Whole Foods website to price what the amount of vegetable lbs would have been, you can calculate the total price by multiplying the total number of lbs harvested by the price / lbs. Then subtract off the summed price with tax for each of the vegetables to find how much money was saved per vegetables.
garden_harvest %>%
filter(vegetable == 'tomatoes') %>%
group_by(variety) %>%
summarise(total_g = sum(weight),
first_day = min(date)) %>%
mutate( total_lbs = total_g * 0.00220462) %>%
ggplot(aes(y = fct_rev(fct_reorder(str_to_title(variety), first_day)), x = total_lbs)) +
geom_bar(stat='identity') +
labs(title = "Tomato Variety Total Pounds by First Harvest Date", x = "Lbs", y = "") +
geom_col(fill = "pink") +
geom_text(aes(label = first_day, hjust = 1.25, nudge_x = -.5), size = 3)
garden_harvest data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use str_to_lower(), str_length(), and distinct().garden_harvest %>%
mutate(lower_variety = str_to_lower(variety),
variety_length = str_length(lower_variety)) %>%
distinct(lower_variety, variety_length) %>%
arrange(variety_length)
garden_harvest data, find all distinct vegetable varieties that have “er” or “ar” in their name. HINT: str_detect() with an “or” statement (use the | for “or”) and distinct().garden_harvest %>%
filter(str_detect(variety, "er") | str_detect(variety, "ar") ) %>%
distinct(variety, vegetable)
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
A typical Capital Bikeshare station. This one is at Florida and California, next to Pleasant Pops.
One of the vans used to redistribute bicycles to different stations.
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data. We do this a little differently than usual, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
NOTE: The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you should access the full data set of more than 600,000 events by removing -Small from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started. Make the following plots and interpret them:
sdate. Use geom_density().Trips %>%
ggplot(aes(x = sdate)) +
geom_density(fill = "maroon4") +
labs(title = "Density of Bike Rental Events Over Time", x = "", y ="") +
theme(axis.text.y = element_blank())
Interpret: Between October and January, we see the frequency of bike rentals decrease. This makes sense because the weather gets colder and colder so we would expect to see less people renting bikes in the winter.
mutate() with lubridate’s hour() and minute() functions to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.Trips %>%
mutate(time_dec = hour(sdate) + minute(sdate) / 60) %>%
ggplot(aes(x = time_dec)) +
geom_density(fill = "maroon3") +
labs(title = "Number of Bike Rental Events per Time of Day", x = "", y ="") +
theme(axis.text.y = element_blank())
Interpret: Across the hours of the day, we see the frequency of bike rentals increase in the morning as people wake up and start going to work. There is a peak in use at around 9 am when that is rush our of going to work. There is a dip after than as people settle into their days. There is another increase in frequency around lunch time as people rent bikes to go get lunch, and finally a last peak around 5/6 pm when people go home from work! There is a little increase around 1 am and this could be because of people riding home from a night out.
Trips %>%
mutate(DoW = wday(sdate, label = TRUE, abbr = TRUE)) %>%
ggplot(aes(y = fct_rev(DoW))) +
geom_bar(fill = "maroon2", col = "maroon") +
labs(title = "Number of Bike Rental Events per Day of Week", x = "", y ="")
Interpret: We see here that the largest number of rental events occur during the weekdays. This suggests to me that most people use the bike rental service to commute to their jobs. The rentals are also used on the weekends though for errands and fun things!
Trips %>%
mutate(time_dec = hour(sdate) + minute(sdate) / 60,
DoW = wday(sdate, label = TRUE, abbr = TRUE)) %>%
ggplot(aes(x = time_dec)) +
geom_density(fill = "pink") +
labs(title = "Density of Bike Rental Events per Time of Day", x = "", y ="") +
facet_wrap(~ DoW) +
theme(axis.text.y = element_blank())
Interpret: From these graphs, we see a similar trend in rental frequency between the last few graphs. On weekdays, there is a consistent trend in peak rentals in the mornings and evenings when people are commuting to and from work, in addition to a slight bump over lunch time when people get food. On weekends, there a general peak slope through the entire day as people are out and about doing fun things at all times of the day. We also see smaller bumps around 1 am over the weekends when people are out late more often.
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). The next set of exercises investigate whether these two different categories of users show different rental behavior and how client interacts with the patterns you found in the previous exercises.
fill aesthetic for geom_density() to the client variable. You should also set alpha = .5 for transparency and color=NA to suppress the outline of the density function.Trips %>%
mutate(time_dec = hour(sdate) + minute(sdate) / 60,
DoW = wday(sdate, label = TRUE, abbr = TRUE)) %>%
ggplot(aes(x = time_dec)) +
geom_density(aes(fill = client), color = NA, alpha = 0.5) +
labs(title = "Density of Bike Rental Events per Time of Day by Day of Week", x = "", y ="", fill = "Client") +
facet_wrap(~ DoW) +
theme(axis.text.y = element_blank())
Interpret: By specifying the type of client, we learn more about how different riders use the bike service. It appears that the Registered clients are the one that rent bikes during the week to commute to work as their peaks are at 9 and 5 ish. The casual rider distribution actually doesn’t appear to vary much across the days of the week. The rentals are a slow increase over the afternoons of all days.
position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?Trips %>%
mutate(time_dec = hour(sdate) + minute(sdate) / 60,
DoW = wday(sdate, label = TRUE, abbr = TRUE)) %>%
ggplot(aes(x = time_dec)) +
geom_density(aes(fill = client), color = NA, alpha = 0.5, position = position_stack()) +
labs(title = "Density of Bike Rental Events per Time of Day and Day of Week", x = "", y ="", fill = "Client") +
facet_wrap(~ DoW) +
theme(axis.text.y = element_blank())
Interpret: This is worse for storytelling because the stack function adds the Casual distribution to be shown on top of the Registered distribution rather than starting the values at zero again. This prevents us from seeing the true trend over time of the Causal riders because we see residual trends left over from the registered clients. This stack could be good for see the cumulative trend for all clients as the last layer. The Casual rider outline shows the same trend when we don’t facet at all for client. So what we actually see is the amount the total trend that the Registered riders make up.
position = position_stack()). Add a new variable to the dataset called weekend which will be “weekend” if the day is Saturday or Sunday and “weekday” otherwise (HINT: use the ifelse() function and the wday() function from lubridate). Then, update the graph from the previous problem by faceting on the new weekend variable.Trips %>%
mutate(time_dec = hour(sdate) + minute(sdate) / 60,
DoW = wday(sdate, label = TRUE, abbr = TRUE),
weekend = ifelse(DoW == c("Sat", "Sun"), 'weekend', 'weekday')) %>%
ggplot(aes(x = time_dec)) +
geom_density(aes(fill = client), color = NA, alpha = 0.5) +
labs(title = "Density of Bike Rental Events per Time of Day", x = "", y ="", fill = "Client") +
facet_wrap(~ weekend) +
theme(axis.text.y = element_blank())
Interpret: Looking at weekdays, we see that casual riders rent bikes more in the afternoons while the Registered riders rent bikes more round 8 am and 7 pm at higher intensities. On the weekend both Casual and Registered riders rent bikes most in the afternoon between 12 and 4 with Causal riders renting the most. We do see that Registered riders rent bikes around 1 am more than Casual riders.
client and fill with weekday. What information does this graph tell you that the previous didn’t? Is one graph better than the other?Trips %>%
mutate(time_dec = hour(sdate) + minute(sdate) / 60,
DoW = wday(sdate, label = TRUE, abbr = TRUE),
weekend = ifelse(DoW == c("Sat", "Sun"), 'weekend', 'weekday')) %>%
ggplot(aes(x = time_dec)) +
geom_density(aes(fill = weekend), color = NA, alpha = 0.5) +
labs(title = "Number of Bike Rental Events per Time of Day", x = "", y ="", fill = "Weekend") +
facet_wrap(~ client) +
theme(axis.text.y = element_blank())
Interpret: The trends differ between weekdays and weekends more for Registered riders than Casual riders. It also appears that Casual riders use bikes slightly more on weekends than weekdays with a sharper peak around 3pm. We can also see how Registered riders change their habits between weekdays and weekends when they like to rent their bikes.
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!Trips %>%
group_by(sstation) %>%
count() %>%
left_join(Stations, by = c("sstation" = "name")) %>%
ggplot(aes(x = long, y = lat)) +
geom_point(aes(col = n), alpha = 0.75) +
scale_color_continuous(low = "pink", high = "maroon4") +
labs(title = "Total Number of Bike Departures in Space", col = "Departures", x = "Longitude", y = "Latitude")
Interpret: The highest number of rentals appear to occur in the city or in the most concentrated part of the rental space. All the points are clustered together and it appears that people mostly use the service to get around the within the cities.
Trips %>%
group_by(sstation) %>%
summarise(prop_casual = sum(client == 'Casual') / n()) %>%
left_join(Stations, by = c("sstation" = "name")) %>%
ggplot(aes(x = long, y = lat)) +
geom_point(aes(col = prop_casual), alpha = 0.75) +
scale_color_continuous(low = "pink", high = "maroon4") +
labs(title = "Proportion of Bike Departures by Clients in Space", col = "Departures", x = "Longitude", y = "Latitude")
Interpret: The largest proportion of Casual riders appear to com from outside of the cities or appear to leave the cities and come back. This helps me to conclude that casual riders go between the city and suburbs occasionally whhile the registered riders rely on the bike service to commute around the city more consistently.
DID YOU REMEMBER TO GO BACK AND CHANGE THIS SET OF EXERCISES TO THE LARGER DATASET? IF NOT, DO THAT NOW.
In this section, we’ll use the data from 2022-02-01 Tidy Tuesday. If you didn’t use that data or need a little refresher on it, see the website.
breed_traits dataset on the x-axis, with a dot for each rating. First, create a new dataset called breed_traits_total that has two variables – Breed and total_rating. The total_rating variable is the sum of the numeric ratings in the breed_traits dataset (we’ll use this dataset again in the next problem). Then, create the graph just described. Omit Breeds with a total_rating of 0 and order the Breeds from highest to lowest ranked. You may want to adjust the fig.height and fig.width arguments inside the code chunk options (eg. {r, fig.height=8, fig.width=4}) so you can see things more clearly - check this after you knit the file to assure it looks like what you expected.breed_traits_total <-
breed_traits %>%
select(- c(`Coat Type`, `Coat Length`)) %>%
mutate(total_rating = rowSums(select(., -Breed))) %>%
filter(total_rating > 0) %>%
select(Breed, total_rating)
breed_traits_total %>%
ggplot(aes(x = total_rating, y = fct_reorder(Breed,total_rating))) +
geom_bar(stat = 'identity', fill = "darkred") +
labs(title = "Total Rating by Dog Breed", x = "", y ="")
breed_rank_all dataset). The points within each breed will be connected by a line, and the breeds should be arranged from the highest median rank to lowest median rank (“highest” is actually the smallest number, eg. 1 = best). After you’re finished, think of AT LEAST one thing you could you do to make this graph better. HINTS: 1. Start with the breed_rank_all dataset and pivot it so year is a variable. 2. Use the separate() function to get year alone, and there’s an extra argument in that function that can make it numeric. 3. For both datasets used, you’ll need to str_squish() Breed before joining.special_breeds <- breed_rank_all %>%
pivot_longer(cols = `2013 Rank`:`2020 Rank`,
names_to = "Year",
values_to = "Rank" ) %>%
separate(Year,
into = "Year",
extra = 'drop',
convert = TRUE) %>%
inner_join(breed_traits_total %>% mutate(Breed = str_squish(Breed)) %>% arrange(total_rating) %>% top_n(20),
by = 'Breed')
breed_traits_total %>% mutate(Breed = str_squish(Breed)) %>% arrange(total_rating) %>% top_n(20)
special_breeds %>%
ggplot(aes(y = fct_reorder(Breed, total_rating), x = Year, color = Rank)) +
geom_point() +
geom_line(aes(group =Breed)) +
labs(title = 'Top 20 Dog Breeds Ranked Across Time', x = "", y ="" ) +
scale_color_continuous(low = "maroon4", high = "white")
The lines connecting each point do not mean much. I would do the top 10 and have color be Breed, and put rank on the y to see how rank varied over time x for each breed.
join or pivot function (or both, if you’d like), a str_XXX() function, and a fct_XXX() function to create a graph using any of the dog datasets. One suggestion is to try to improve the graph you created for the Tidy Tuesday assignment. If you want an extra challenge, find a way to use the dog images in the breed_rank_all file - check out the ggimage library and this resource for putting images as labels.breed_rank_all %>%
pivot_longer(cols = `2013 Rank`:`2020 Rank`,
names_to = "Year",
values_to = "Rank" ) %>%
separate(Year,
into = "Year",
extra = 'drop',
convert = TRUE) %>%
inner_join(breed_traits %>%
filter(`Drooling Level` == '5') %>%
mutate(Breed = str_squish(Breed)),
by = 'Breed') %>%
ggplot(aes(color = fct_rev(fct_reorder(Breed, Rank)), x = Year, y = Rank))+
geom_point() +
geom_line(aes(group = Breed)) +
labs(title = "Ranks Across Time of Dog Breeds that Drool the Most", x = "", y ="", color = 'Breed')
Here we see that out of the 5 dogs that got a 5 score on ‘Drooling Level’, Newfoundlands are the most highly ranked and are the best while Neapolitan Mastiffs are the lowest ranked.
This problem uses the data from the Tidy Tuesday competition this week, kids. If you need to refresh your memory on the data, read about it here.
facet_geo(). The graphic won’t load below since it came from a location on my computer. So, you’ll have to reference the original html on the moodle page to see it.DID YOU REMEMBER TO UNCOMMENT THE OPTIONS AT THE TOP?